Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses
نویسندگان
چکیده
HDFS has become an important data repository in the enterprise as the center for all business analytics, from SQL queries, machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have the efficient SQL support. In this paper, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement, and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm, and show that it is a robust join algorithm for hybrid warehouses which performs well in almost all cases.
منابع مشابه
A Data Pre-partitioning and Distribution Optimization Approach for Distributed Data Warehouses
The increasing volumes of relational data let us find an alternative to cope with them. The Hadoop framework an open source project based on the MapReduce paradigm is a popular choice for distributed data warehouses and big data analytics. In this paper, we propose an original approach for partitioning and collocating data in distributed file systems, especially Hadoop-based systems, and this, ...
متن کاملBuilding Data Warehouses Using the Enterprise Modeling Framework
This paper proposes an enterprise modeling framework for the deployment of data warehouses. The framework provides the information roadmap coordinating source data and different data warehouses across the business enterprise. The paper introduces a solution to address data warehousing issues at the enterprise level while avoiding the pitfalls of creating enterprise data warehouses and universal...
متن کاملAutomatic Workload Management for Enterprise Data Warehouses
Modern enterprise data warehouses have complex workloads that are notoriously difficult to manage. Additionally, RDBMSs have many “knobs” for managing workloads efficiently. These knobs affect the performance of query workloads in complex interrelated ways and require expert manual attention to change. It often takes a long time for a performance expert to get enough experience with a large war...
متن کاملData Mining for Intelligent Enterprise Resource Planning System
Enterprise Resource Planning or ERP is the practice of consolidating an enterprise’s planning, manufacturing, sales and marketing efforts into one management system. It attempts to integrate all departments and functions across a company onto a single computer system that can serve all those different departments' particular needs. This paper proposed an intelligent ERP system by integrating en...
متن کاملPersistence in Enterprise Data Warehouses
Yet, persistence of redundant data in Data Warehouses is often simply justified with an achievement of better performance when accessing data for analysis and reporting. Especially in Enterprise Data Warehouse systems, data management via multiple persistence levels is necessary to condition the huge amount of data into an adequate format for its final usage. However, there are further reasons ...
متن کامل